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Abstract The hierarchical Dirichlet process (HDP) has be- 
come an important Bayesian nonparametric model for grouped 
data, such as document collections. The HDP is used to con- 
struct a flexible mixed-membership model where the num- 
ber of components is determined by the data. As for most 
Bayesian nonparametric models, exact posterior inference 
is intractable — practitioners use Markov chain Monte Carlo 
(MCMC) or variational inference. Inspired by the split-merge 
MCMC algorithm for the Dirichlet process (DP) mixture 
model, we describe a novel split-merge MCMC sampling 
algorithm for posterior inference in the HDP. We study its 
properties on both synthetic data and text corpora. We find 
that split-merge MCMC for the HDP can provide significant 
improvements over traditional Gibbs sampling, and we give 
some understanding of the data properties that give rise to 
larger improvements. 

Keywords hierarchical Dirichlet process • Markov chain 
Monte Carlo • split and merge 

1 INTRODUCTION 

The hierarchical Dirichlet process (HDP) lfT4ll has become 
an important tool for the unsupervised data analysis of grouped 
data [16 1, for example, image retrieval and object recogni- 
tion ifm . multi -population haplotype phasing [ 18 1, time se- 
ries modeling |7| and Bayesian nonparametric topic mod- 
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eling |14|. Specially, topic modeling is the scenario when 
the HDP is applied to document collections, and each docu- 
ment is considered to be a group of observed wordsF]This is 
an extension of latent Dirichlet allocation (LDA) |4| that al- 
lows a potentially unbounded number of topics (i.e., mixture 
components). Given a collection of documents, posterior in- 
ference for the HDP determines the number of topics from 
the data. 

As for most Bayesian nonparametric models, however, 
exact posterior inference is intractable. Practitioners must 
resort to approximate inference methods, such as Markov 
chain Monte Carlo (MCMC) sampling |14] and variational 
inference JT5]. The idea behind both of these methods is to 
form an approximate posterior distribution over the latent 
variables that is used as a proxy for the true posterior. 

We will focus on MCMC sampling, where the approxi- 
mate posterior is formed as an empirical distribution of sam- 
ples from a Markov chain whose stationary distribution is 
the posterior of interest. The central MCMC algorithm for 
the HDP is an incremental Gibbs sampler fl4l l8l. which may 
be slow to mix (i.e., the chain must be run for many itera- 
tions before reaching its stationary distribution). For exam- 
ple in topic modeling — where each word of each document 
is assigned to a topic — the incremental Gibbs sampler only 
allows changing the topic status of one observed word at a 
time. This precludes large changes in the latent structure]^] 
Our goal is to improve Gibbs sampling for the HDP. 

We develop and study a split-merge MCMC algorithm 
for the HDP. Our approach is inspired by the success of split- 
merge MCMC samplers for Dirichlet process (DP) mixtures [9, 



We focus on topic modeling in this paper. We will use "HDP" and 
"HDP topic model" interchangeably. 

2 In 1141 . the Gibbs sampling based on the Chinese restaurant fran- 
chise (CRF) representation does allow the possibility of changing the 
status of some words in a document together. However, as stated 
in 1 14 1, this is a prior clustering effect and does not have practical ad- 
vantages. 
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|5]jj The DP mixture (2j is an "infinite clustering model" 
where each observation is associated with a single compo- 
nent. (In contrast, the HDP is a mixed-membership model.) 
In split-merge inference for DP mixtures, the Gibbs sam- 
pler is embellished with split-merge operations. Two obser- 
vations are picked at random. If the observations are in the 
same component then a split is proposed: all the observa- 
tions associated with that component are divided into two 
new components. If the observations are in different com- 
ponents then a merge is proposed: the observations from the 
two components are placed in the same component. Finally, 
whether the resulting split or merged state is accepted is de- 
termined by Metropolis-Hastings. As demonstrated in J9JI3, 
split-merge MCMC is effective for DP mixtures when the 
mixture models have overlapping clusters. 

Our split-merge MCMC algorithm for the HDP is based 
on the Chinese restaurant franchise (CRF) representation of 
a two-level HDP JT4|, where "customers" are partitioned at 
the group-level and "dishes" are partitioned at the top level. 
In an HDP topic model, the customer partition represents the 
per-document partition of words; the top level partition rep- 
resents the sharing of topics between documents. The split- 
merge algorithm for HDPs operates at the top level — thus, 
the assignment of subsets of documents across the corpus 
may be split or merged with other subsets. (The reason we 
don't do split-merge operations for the lower level DP is de- 
tailed in Section [3]) We first demonstrate our algorithm on 
synthetic data, and then study its performance on three real- 
world corpora. We see that our split-merge MCMC algo- 
rithm can provide significant improvements over traditional 
Gibbs sampling, and we give some understanding of the data 
properties that give rise to larger improvements. 



2 THE HDP TOPIC MODEL 

The hierarchical Dirichlet process [ 14 1 is a hierarchical gen- 
eralization of the Dirichlet process (DP) distribution on ran- 
dom distributions J6j. We will focus on a two-level HDP, 
which can be used in an infinite capacity mixed-membership 
model. In a mixed-membership model, data are groups of 
observations, and each exhibits a shared set of mixture com- 
ponents with different proportion. We will further focus on 
text-based topic modeling, the HDP topic model. In this set- 
ting, the data are observed words from a vocabulary grouped 
into documents; the mixture components are distributions 
over terms called "topics." 

In an HDP mixed-membership model, each group is as- 
sociated with a draw from a shared DP whose base distribu- 
tion is also a draw from a DP, 

G ~DP(7,#) 
Gj | G ~ DP(ao, G ), for each 



Here we focus on the conjugate models. The non-conjugate ver- 
sion of split-merge MCMC samplers for DP mixtures is presented 
in [TO). 



where j is a group index. At the top level, the distribution Go 
is a draw from a DP with concentration parameter y and base 
distribution H. It is almost surely discrete, placing its mass 
on atoms drawn independently from H (6). At the bottom 
level, this discrete distribution is used as the base distribu- 
tion for each per-group distribution Gj. Though they may 
be defined on a continuous space (e.g., the simplex), this 
ensures that the per-group distributions Gj share the same 
atoms as Go- 

In topic modeling, each group is a document of words 
and the atoms are distributions over words (topics). The base 
distribution H is usually chosen to be a symmetric Dirich- 
let over the vocabulary simplex, i.e., the atoms <j> = (^>k)k=i 
are drawn independently fa ~ Dirichlet (tj). To complete the 
HDP topic model, we draw the ith word in the jth document 
Xjj as follows, 

Oji~Gj, Xj i~Mvlt(0ji). (1) 

We will show how 0,, is related to in next section. The 
clustering effect of the Dirichlet process ensures that this 
yields a mixed-membership model, where the topics are shared 
among documents but each document exhibits them with 
different proportion. Based on this important property, we 
now turn to an alternative representation of the HDP. 



2. 1 The Chinese Restaurant Franchise 

Consider a random distribution G drawn from a DP and a set 
of variables drawn from G. Integrating out G, these variables 
exhibit a clustering effect — they can be grouped according 
to which take on the same value 0. The values associated 
with each group are independent draws from the base distri- 
bution. The distribution of the partition is a Chinese restau- 
rant process (CRP) 0]. 

In the HDP, the two levels of DPs enforce two kinds of 
grouping of the observations. First, words within a docu- 
ment are grouped according to those drawn from the same 
"unique" atom in Gj. (By "unique," we mean atoms drawn 
independently from Go-) Second, the grouped words in each 
document are themselves again grouped according to those 
associated with the same atom in Go- Note that this corpus- 
level partitioning connects groups of words from different 
documents. (And, it may connect two groups of words from 
the same document.) These partitions — the grouping of words 
within a document and the grouping of word-groups within 
the corpus — are each governed by a CRP. 

As a consequence of this construction, the atoms of Go, 
i.e., the population of topics, are shared by different docu- 
ments. And, because of the clustering of words, each docu- 
ment individually may exhibit several of those topics. This 
is the key property of the HDP. 

The split-merge Gibbs sampler that we develop below 
relies on a representation of the HDP based on these parti- 
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Fig. 1 A Depiction of a CRF, adapted from 1141 , Each restau- 
rant/document is represented by a rectangle. Customer/word xu is 
seated at a table (circles) in restaurant/document j via the customer- 
specific table index fa. Each table has a dish/topic, one of the global 
dishes/topics (0*s), indicated by the table-specific dish/topic index, ku. 

notation description 



tijp 

»j-k 
nj.. 

n-k 
n'jk 
mj. 



# words in document j at table / and topic k. 

# words in document j at table / 

# words in document j belonging to topic k 

# words in document j 

# words belonging to topic k in the corpus 

# tables in document j belonging to topic k 

# tables in document j 

# tables belonging to topic k in the corpus 
total tables in the corpus 



Table 1 Notation used in the CRF representation 



tion probabilities, rather than on the random distributions]^] 
This representation is known as the Chinese restaurant fran- 
chise (CRF) 03], a hierarchy of CRPs. At the document 
level, words in a document are grouped into "tables" accord- 
ing to a CRP for that document; at the corpus level, tables 
are grouped into "dishes" according a corpus-level CRP. All 
the words that are attached (via their table) to the same dish 
are drawn from the same topic. This is illustrated in Figure]!] 

We now give the generative process for an HDP topic 
model based on the CRF representation, which is important 
for developing the split-merge MCMC algorithm. Let tji de- 
note table index for Xji, the word i in document /', and kj, 
denote the dish index, i.e., the global topic index, for table 
t in document j. The model probabilities are based on sev- 
eral types of counts using on these fundamental elements. 
Notation is tabulated in Table Q] 

There are three steps to the process. First, we generate ta- 
ble indices for each word in each document — this partitions 
the words within the documents. Then, we generate topic 

4 In the HDP in general, Gibbs samplers tend to operate on parti- 
tions; variational methods tend to operate on constructions of the ran- 
dom distributions. 



indices for each table in each document — this partitions the 
tables (which are groups of words) according to topics. Fi- 
nally, for each word we generate its type (e.g., "house" or 
"train") from the assigned topic of its assigned table. 



Generate the table index f «. In the document-level CRP for 
document j, table index tji is generated sequentially accord- 
ing to 



Tljt. , ift— 1, ...,77/ , 
(Xq . if t = f n ew j 



This induces the probability of partitioning the words of 
document j into tables, 



n-ia+oo-i) 



p(tj) = 



(2) 



Generate the topic index k ]t . After all words in all docu- 
ments are partitioned into tables, we generate the topic index 
kj t for each table in each documents. As above, the topic in- 
dex comes from the corpus-level CRP, 



p(kj t | k\\ ,&i2, • • • , &21 , , y) 

m.k, if k — 1, . . . ,K, 
, if k — knpw , 



(3) 



which induces the probability for the partition of all the ta- 
bles. Below, D is the number of documents and K is the 
number of used topics in the set of topic indices, 



P (k) 



i 



(4) 



Let c denote the complete collection of table and topic 
indices, t = {tj)^ =l and k. We combine Eq.|2jand|4] 

p(c)=p(k)p(t)=p(k)U% l p(t j )- (5) 

Generate word observations xjj. Finally, we generate the 
observed words Xji- In the previous representation of the 
HDP in Eq.JT] the words were generated given a topic 0,,. In 
this representation, each Xji is associated with a table index 
tji, and each table is associated with a topic index kj t , which 
links to one of the topics ^. Define zjt = kj tjj , 

On 0. .. Xji ~Mult(8 ji). 

We call Zji the topic index for word Xji. It locates the topic 
0£ from which xji is generated. 

Since different words with the same value of zji are drawn 
from the same topic (j) Zjj , we can consider the conditional 
likelihood of the corpus x = (Xj)*?] given all the latent in- 
dices c. We are integrating out the topics <j). (Recall (p^ ~ 
Dirichlet (tj).) This conditional likelihood is 



p(x\c) = Ukf k ({xji : zji =k}),, 



(6) 
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where 



fk{{xji:Zji = k}) = 



r(VT?) u v r(n v , k + ri) 

r(n.. k +Vri) r v {ri) 



The size of the vocabulary is V and the number of words as- 
signed to topic k is n]' k . This completes the generative pro- 
cess for an HDP topic model. 



3 SPLIT-MERGE MCMC FOR THE HDP 

Given a collection of documents, the goal of posterior in- 
ference is to compute the conditional distribution of the la- 
tent structure, the assignment of documents to topics and the 
distributions over words associated with each topic. In iTPfl 
posterior inference is based on the CRF representation pre- 
sented above. Their Gibbs sampler iteratively samples the 
table indices t and topic indices k. Details are in 1 14 1. 

We now develop split-merge MCMC for the HDP. Our 
motivation is that incremental Gibbs samplers can be slow to 
converge, as they only sample one variable at a time. Split- 
merge algorithms can consider larger moves in the state space 
and have the potential to converge more quickly. To con- 
struct the split-merge MCMC algorithm for the HDP, we 
first recall that, from Eq.[3]and[4] the top level of the HDP 
can be described as the corpus-level CRP with tables from 
all documents as observations. The idea behind our algo- 
rithm is to use split-merge MCMC algorithm for the DP 
mixtures |9]|5) at this top-level. 

To make the above concrete, we first review the tradi- 
tional Gibbs sampling algorithm for the topic indices k and 
then present the split-merge MCMC algorithm. At the end 
of this section, we discuss why we only use split-merge op- 
erations on the top level of the HDP. 



3.1 Gibbs sampling for topic indices k 

Since all tables in the corpus are partitioned into topics ac- 
cording to the corpus-level CRP (see Eq.[3]and|4]), we follow 
the standard procedure (Algorithm 3 in Neal's paper |fT3l ) 
to derive their Gibbs sampling updates. Let f k j ' (xp) de- 
note the conditional density of Xp (all the words at table t in 
document j) given all words in topic k, excluding xp, 



fr'(*j') = 

r(«:f ! + V77) \l i m k x : ■ + 77) 



(7) 



r(n„ k J '+n x J'+Vri) 



•rj) 



where n J" is the number of word v with topic k, excluding 
Xji. See the appendix for the relevant derivations, which is 
same as in lfl4ll . 

Because of the exchangeability of the CRP, we can view 
the table kp as the last table. Thus, by combining Eq. [3] 



and|7] we obtain the Gibbs sampling algorithm for kp fT3l . 



, , s I ml*/ * J '(Xj,), if k= l,., K 

p(k jt = k\t,k yf ,x)oc > •* J k ^ J"' 



Note that changing kp changes the topic of all the words in 
xp together. However, as discussed in lfl4l . assigning words 
to different tables with the same k is a prior clustering ef- 
fect of a DP with rij.k customers (and only happens in one 
document). Reassigning kp to other topics is unlikely. 



3.2 A split-merge algorithm for topic indices k 

The split-merge MCMC algorithm for DP mixture models [9 
[3J starts by randomly choosing two observations. If they are 
in the same component then a split is proposed, where all 
the observations in this component are assigned into two 
new components (and the old componenti is removed). If 
they are in two different components then a merge is pro- 
posed, where all the observations in the two components are 
merged into one new component (and the two old compo- 
nents are removed). Whether the proposal of split or merge 
is accepted or not is determined by the Metropolis-Hastings 
ratio. 



Based on the discussion in section 3.1 the split-merge 
MCMC algorithm for DP mixture models can be used at 
the top level of the HDP by viewing the tables as the obser- 
vations for the corpus-level CRP. We present our algorithm 
using the sequential allocation approach proposed in . 
(This is easy to modify to use the intermediate scans used 
in [9|.) Thus, the sketch of this algorithm is similar to the 
one presented in 1 5 1 . 

We describe the procedure when a split is proposed. Two 
tables have been selected and their assigned topics are the 
same — this is the selected topic. We then create two new 
topics with each containing one of the tables just selected. 
Finally, we consider all the other tables in the corpus as- 
signed to the selected topic, and assign those tables into the 
two new topics. Following (5j, this is done by running a 
"mini one-pass" Gibbs sampler over only the two new top- 
ics, and partitioning each table into one or other. We call the 
new state containing the two new topics the split state. 

In more detail, let c be the current state and c sp ii t be the 
split state. We use (j,t) to indicate the table t in document 
j. Then two selected tables are represented as (ji,fi) and 
(j2,h)- In state c ' the selected topic is k = kj ltl = kj 2t2 . Fur- 
ther, let S c be the set of tables whose topic is k excluding 
table (ji,h) and (j'2^2), 

Sc = {(j,t) : C/,0 f C/VO.C/.O + {h,ti),kj t =k}. 

In state c sp ii t , in order to replace topic k, we create two new 
topics, k\ and ki, then assign kj t /, = k\ and kj 2 f 2 — ki for the 
two initially selected tables. Let us now show how we use 
sequential allocation restricted Gibbs sampling (the "mini 
one-pass Gibbs sampler" we just mentioned) (5) to reach 
split state c sp H t from state c. 
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Sequential allocation restricted Gibbs sampling. Define 

51 = {(h,h)} and ^2 = {(j2,h)} and recall that k juh 

k and k^, 2 = k 2 . Let m./ ci = \Si\ and m.^ — \S 2 \ be the 
number of tables in Si and S 2 . We will use set S\ or S 2 to 
receive the tables from S c that will be assigned to k\ or k 2 
in the sequential restricted Gibbs sampling procedure. Let 
(j,t) be successive table indexes in a uniformly permuted 
S c and sample kj t according to the following, 

p(k jt = k e \Si,S 2 ) « m.k e f ke ] '(xjt),£= 1,2. 

This is a one-pass Gibbs sampling of kp but restricted over 
only topic k\ and ^-called sequential allocation restricted 
Gibbs sampling in [5|. We have 

If k — k\, then S\ <— Si U (j,t), m.^ +- m.^ + 1, 
otherwise, S 2 ^— S 2 U (j,t) , m.^ <— m.^ + 1 

Repeat this until all tables in S c are visited. The final Si and 

52 will contain all the tables that are assigned to ki and k 2 
in state c sp ii t . Let the realization of kp be kp(r), then we can 
compute the transition probability from state c to split state 

Cspiit as 

q(c ->■ c sp iit) =Uu,t)esP(kjt =kj,(r)\SuS 2 )- 

Note that we have abused notation slightly, since Si and 52 
are changing when we sample kp. The transition probability 
from split state c sp ii t to state c is 

<7(c sp iit -*c) = l. 

This is because there is only one way to merge two topics ki 
and k 2 in state c sp ii t into topic k in state c. 

Now we decide whether to retain the new partition, with 
the split topics in place of the original topic, or whether to re- 
turn to the partition before the split. This decision is sampled 
from the Metropolis -Hastings acceptance ratio. The proba- 
bility of accepting the split is 



Si ' = 



P(Csplit) ^(Cspiit) g(c sp iit ->■ C) 

p(c) L(c) q{c -> c S pii t ) ' 
We obtain the prior ratio from Eq. [5] 

gCgsplit) _ (gji ~ l)!(w».t 2 - 1)! 
p(c) ~ Y (m. k -l)\ 

Define L(c) = p(x \ c) in Eq.|6] The likelihood ratio is 

i(Cspiit) _ fki {{xp ■ zp = h)}f k2 ({xji : zp = k 2 }) 



L{c) 



fk({xp : Zp = k}) 



The validity of the MCMC proposal can be verified by fol- 
lowing [9 5]. The complete algorithm is described in Fig- 
ure^ which also contains the merge proposal, i.e., the case 
where the two initially selected tables are attached to differ- 
ent topics. 

Discussion. There are two other possibile ways to introduce 
split-merge operations in the HDP. First, we can split and 



merge within the document-level DP. This is of little interest 
because we care more about the words global topic assign- 
ments than their local table assignments. Second, we can 
split and merge by choosing two words and resample all the 
other words in the same topic(s). However, this will be in- 
efficient. Two different words have zero similarity, and so 
two random picked words will hardly serve a good guid- 
ance and could have a very low acceptance ratio. (This is 
different from a DP mixture model applied to continuous 
data, where different data points can have different similari- 
ties/distances.). In contrast, tables can be seen as word vec- 
tors with many non-zero entries, which mitigates this issue. 



4 EXPERIMENTS 

We studied split-merge MCMC for the HDP topic model 
on synthetic and real data. To initialize the sampler, we use 
sequential prediction — we iteratively assign words to a ta- 
ble and a topic according to the predictive distribution given 
the previously seen data and the algorithm proceeds until 
all words are"added" into the model. This works well em- 
pirically and was used in [12] for DP mixture models. In 
addition, we use multiple random starts. Our C++ code is 
available as a general software tool for fitting HDP topic 
models with split-merge and traditional Gibbs algorithms at 



http : //www. cs .princeton. edu/~chongw/ sof tware/hdp . 
|tar.gz| 



4.1 Synthetic Data 

We use synthetic text data to give an understanding of how 
the algorithm works. We generated 100 documents, each 
with 50 words, from a model with 5 topics. There are 12 
words in the vocabulary and each document uses at most 2 
topics. The topic multinomial distributions over the words 
are shown in Figure|3] In this model, topics 1 and 2 are very 
similar — they share 7 words (word 3 to 9) with the same 
highest probability. Topic 1 places high probability on word 

1 but low probability on word 2; topic 2 is reversed. Oth- 
ers topics are less similar to each other — topics 3-5 share 
no words with 1 and 2 and have different distributions over 
the remaining words. Thus, it's expected that topics 1 and 

2 are difficult to distinguish; identifying the rest should be 
easier. Our goal is to demonstrate that without split-merge 
operations, it is difficult for the traditional Gibbs sampler to 
separate topics 1 and 2. 

For the HDP topic models, we set the topic Dirichlet pa- 
rameter 77 = 0.5, hyperparameters 7 and a with Gamma pri- 
ors Gamma(0.1,l) to favor sparsity. (Without sampling 7 
and a, the results are similar.) For this experiment, we run 
one split-merge trial after each Gibbs sweep. We run the al- 
gorithms for 1000 iterations. 

Figure |4] shows the results. In Figure |4j a), we compare 
the modes by plotting the difference of the best per-word 



6 



Chong Wang, David M. Blei 



Assume the current state is c. 

1. Choose two distinct tables, (ji-Ji) and (jz,tz), at random uniformly. 

2. Split case: if = kj 2 , 2 = k, then 

(a) Let S c the set of tables whose topic is k excluding table (yi,fi) and (ji-Ji) in state c, that is 5 C = {(j,f) : (y,f) 7^ (jl,ti),{j,t) ^ 
{h,t 2 ),kj t =k}. 

(b) Assign = £i and fyj^ = k 2 . Randomly permute 5 C , then run sequential allocation restricted Gibbs algorithm to assign the tables in S c 
to ki or k 2 to obtain the split state C sp Ht. Calculate the product of the probability used as q{c — > C sp lit)- 

(c) Calculate the acceptance ratio: 

, _ p(c 5 piit)L(c sp iii) g(c sp iit -> c) 
p(c)L(c) q(c -> C S pii t ) ' 

where g(c sp iit — > c) = 1, since there is only one way to merge two topics from c sp iit to c. 

3. Merge case: if (k hh = h)^ (k h , 2 = k 2 ), 

(a) Let S c be the set of tables whose topics are either ki or k 2 excluding table (y'i,fi) and (j 2 .t 2 ) in state c, that is S c = {(j,t) ■ {j,t) ^ 
C/Vl ),(/,*) 7^ U2,h),kjt=ki ork jt =k 2 }. 

(b) Randomly permute 5 C , then run sequential allocation restricted Gibbs algorithm to assign the tables in S c to k\ or k 2 to reach the original 
split state c. Calculate the product of the probability used as £/(c me rge ~~ > c )- 

(c) Assign = kj 2 t 2 = & and = k for (_/,f) 6 S c to obtain merge state c melge . 

(d) Calculate the acceptance ratio: 

j P(Cmerge)^(Cmerge) ^(Cmerge ^ ^) 
^ = 7 — ^ — 7 — \ 7 C ? 

p(c)L(c) q(c -> c merge ) 

where g(c — > e mer ge) = 1, since there is only one way to merge two topics from c to c merge - 

4. Sample u ~ Unif(0, 1), if » < jz/, accept the move; otherwise, reject it. 



Fig. 2 The split-merge MCMC algorithm for the HDP. 
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Fig. 3 The "word" and "topic" axes indicate the word and topic in- 
dexes. The different sizes of the dots indicate the relative probability 
values of the words in that topic. 

log likelihood up to the same time (here time is the same as 
iteration), which is, 

yt = M,, Gi bbs+SM — M t fiibbs 1 (8) 

where M t Gibbs+SM ar, d Gibbs indicate the modes found be- 
fore time t , i.e., the best per-word log likelihoods for Gibbs 
sampling with split-merge and the pure Gibbs sampling up 
to time t. (This log likelihood is proportional in log space 
to the true posterior — higher log likelihood indicates a state 
with higher posterior probability.) We found that split-merge 
explores the space to a better mode. 



In Figure |4|b), we compared the topic trace plot, which 
contains cumulative ratios of the words assigned to the most 
popular, two most popular, . . . , to all topics. (This was adapted 
from [9|.) Ideally, for our problem, these will be 0.2, 0.4, 0.6, 
0.8 and 1.0. In this experiment, the traditional Gibbs sampler 
gets trapped with 4 topics while our algorithm finds 5 topics 
after several iterations. 

In Figure |4jc) and (d), we visualized the topics obtained 
by each algorithm. Both methods identify the three easy 
topics — topics 3, 4 and 5 in the data, and these correspond 
to topic 2, 3 and 4 in Figure |4jc) and topic 1 , 2 and 5 in 
Figure|4|d). However, the traditional Gibbs sampling cannot 
identify the difference between topic 1 and 2 in the data — 
see topic 1 in Figure Qc), which is a combination of the 
two true topics. In contrast, the split-merge algorithm distin- 
guishes them — see topic 3 and 4 in Figure[4|d). 

Split-merge algorithm introduces new Metropolis-Hastings 
moves and, thus, is computationally more expensive than the 
traditional Gibbs sampling. However, we only split or merge 
at the top-level DP, where a table is treated as an observa- 
tion, and the number of tables is usually much smaller than 
the number of words. So, we expect the additional expense 
to be minimal. In the synthetic data, the difference was neg- 
ligible. 

Finally, in the experiments in [9] for DP mixtures, re- 
verse split-merge moves (i.e. from state A to state B, then 
from state B to state A) are frequent, while we do not see 
this behavior here. We hypothesize that this is because large 
moves, like a split or merge, are only accepted when the 
HDP reaches a much better local mode and therefore the 
chance of making the reverse move is very small. Although 
running the Markov chain for a sufficient long time might 
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Fig. 4 Experimental results on synthetic data, (best viewed in color.) The new algorithm is "Gibbs+SM". (a) The difference of the best per-word 
log likelihood up to the same time, where above zero indicates that Gibbs+SM does better, (b) Trace plot on the ratios truncated at the 100 iteration. 
The rest of iterations is similar. Points are jittered for better view, (c) Topic visualization for "Gibbs"on its best mode. There are 4 topics indicated 
along the "topic" axis, also with different colors, (d) Topic visualization for "Gibbs+SM" at its best mode. There are 5 topics. 



mitigate this issue, in practice we recommend running the 
split-merge operations in the burn-in phase. (This is also the 
strategy we use in the analysis on real data.) 



4.2 Analysis of Text Corpora 

We studied split/merge MCMC on three text corpora: 

- ARXIV: This is a collection of 2000 abstracts (randomly 
sampled) from online research abstract^] The vocabu- 
lary has 2441 unique terms and the entire corpus con- 
tains around 89K words. 

- ML+IR: This is a collection of 2080 conference abstracts 
downloaded from machine learning (ML) and informa- 
tional retrieval (IR) conferences, including CIKM, ICML, 



http : / /arxiv . org 



KDD, NIPS, SIGIR and WWW from year 2005-2008 [170 
The vocabulary has 3237 unique terms and the corpus 
contains around 118K words. 
- NIPS: This is a collection of 1392 abstracts, a subset of 
the NIPS articles published between 1988-19993 The 
vocabulary has 4368 unique terms and the entire corpus 
contains around 263K words. 

HDP analysis is unsupervised, so there is no ground truth. 
Thus we compare algorithms by only examining the modes 
using the per-word log likelihood of the training set (80% 
entire data) and the per-word heldout log likelihood, for the 
testing set (20% entire data). We use hyperparameters y and 
a with Gamma priors Gamma(l., 1.). We let tj =0.1,0.2,0.5. 
In general a smaller 77 leads to more topics, because the prior 
enforces that the topics are sparser. In addition, as we find 

6 http: //www. cs .princeton. edu/~chongw/data/6conf . 
Hhttp: //www. cs .ut or onto . ca/~sroweis/nips 
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Fig. 5 The likelihood comparison for ARXIV and ML+IR (77 = 0.2). 
Trends are similar for other settings. 



Second, for tj = 0.2 and tj = 0.5 in ML+IR, there are 
fewer topics (but not few enough so that they cannot be dis- 
tinguished) than when 77 = 0.1 in the corpus — each topic 
contains more tables, and thus the chance of picking two in- 
formative tables that can serve a good guidance for the split- 
merge operation is higher. For NIPS, the reason it works for 
tj = 0.2 is the same as ML+IR, for 77 = 0.5, the corpus might 
just need very few topics to explain itself and thus the topics 
obtained are easy to separate. 

In summary, on real data, Gibbs+SM is at least as good 
as Gibbs sampling and sometimes helps speed convergence. 
On average, split-merge operations are accepted at around 
3%. In general, the split-merge operations improve perfor- 
mance when sets of similar topics exist in the corpus. 
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Fig. 8 Plot of topic similarities. Each dot's x-axis indicates a similarity 
value of one pair of topics using cosine similarity. (The y-axis is for 
jitter and does not carry meaning.) The ranking of having similar topics 
is, ML+IR > NIPS > ARXIV. 



out in the synthetic experiments, that it is difficult for split- 
merge operations make reverse moves, we only run split- 
merge operations for the first 50 iterations while the entire 
Gibbs sampler for 500 iterations. Since the data used here is 
much larger than the synthetic data, we compare the algo- 
rithms using computation time, rather than number of iter- 
ations. Figures [6] and [7] show the results. Here we focus on 
the difference of modes. For the log likelihood itself, please 
see example Figure [5] 

This experimental results are summarized as follows. 
Towards the end of the runs, Gibbs+SM and Gibbs are not 
too different. Gibbs+SM reaches a better mode faster with 
tj = 0.2 and tj = 0.5 for ML+IR and tj = 0.2 for NIPS, 
while on other settings, Gibbs+SM was equally as good as 
Gibbs sampling. 

We hypothesize that there are two reasons. First, as we 
see from the experiments on synthetic data, if topics are easy 
to separate, the traditional Gibbs sampler will find them eas- 
ily as well. In Figure [8] we plot the topic similarities in dif- 
ferent corpora for tj = 0.5. We see that ARXIV and NIPS 
do not have many similar topics. For ML+IR, however, there 
are more similar topics. (This is also demonstrated in ifTTl .) 
Gibbs+SM is more likely to work well in this scenario. 



5 CONCLUSIONS AND FUTURE WORK 

We presented a split-merge MCMC algorithm for the HDP 
topic model. We showed on both synthetic and real data that 
split-merge MCMC algorithm is effective during the burn-in 
phase of HDP Gibbs sampling. Further, we gave intuitions 
for what properties of the data lead to improved performance 
from split-merge MCMC. 

Recently, Gibbs samplers based on the distance depen- 
dent Chinese restaurant process (ddCRP) J3] have demon- 
strated improved convergence for DP mixture models. Ap- 
plying these ideas to the HDP is worth exploring. 
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Appendix 

We review Gibbs sampling for t fl4l . Define the conditional 
density of xji given all words in topic k except xu, 

If( x ji I <Pk) I I//' • /,..• , , =k f{Xj'ii\§k)K§k) A §k 

r A jl f \ J I J 

h {Ji) ~ Snrwi^AxMhWhWit ' 

/(■|0) is the Mult(0) and/i(-) is the density of H, Dirichlet(rj). 
Since h(<j>) is a Dirichlet distribution, /, (xji), 



f k Xi> (xji = v) 



l -k 



Note f k (xp) and fk({xn : zji — k}) in the main text can 
be similarly derived. 

The likelihood for tp = t, when t = 1, . . . ,ntj. is f k (xji), 
since table t is linked to the topics through kp. For tjt = t Bew , 
the likelihood is calculated by integrating out Go, 

where f. ''(xji) = J f(xji\<j))h((j))d<j) is the prior density for 
Xji. We have 

p(t ji = t\t-j i ,k) 

« j n ]t J 'fk* J ' ( x fl ) > if tji = I,..., mj., 



